35 research outputs found
Combining phonological and acoustic ASR-free features for pathological speech intelligibility assessment
Intelligibility is widely used to measure the severity of articulatory problems in pathological speech. Recently, a number of automatic intelligibility assessment tools have been developed. Most of them use automatic speech recognizers (ASR) to compare the patient's utterance with the target text. These methods are bound to one language and tend to be less accurate when speakers hesitate or make reading errors. To circumvent these problems, two different ASR-free methods were developed over the last few years, only making use of the acoustic or phonological properties of the utterance. In this paper, we demonstrate that these ASR-free techniques are also able to predict intelligibility in other languages. Moreover, they show to be complementary, resulting in even better intelligibility predictions when both methods are combined
Detection of persons with Parkinson's disease by acoustic, vocal, and prosodic analysis
Abstract—70 % to 90 % of patients with Parkinson’s disease (PD) show an affected voice. Various studies revealed, that voice and prosody is one of the earliest indicators of PD. The issue of this study is to automatically detect whether the speech/voice of a person is affected by PD. We employ acoustic features, prosodic features and features derived from a two-mass model of the vocal folds on different kinds of speech tests: sustained phonations, syllable repetitions, read texts and monologues. Classification is performed in either case by SVMs. A correlation-based feature selection was performed, in order to identify the most important features for each of these systems. We report recognition results of 91 % when trying to differentiate between normal speaking persons and speakers with PD in early stages with prosodic modeling. With acoustic modeling we achieved a recognition rate of 88 % and with vocal modeling we achieved 79%. After feature selection these results could greatly be improved. But we expect those results to be too optimistic. We show that read texts and monologues are the most meaningful texts when it comes to the automatic detection of PD based on articulation, voice, and prosodic evaluations. The most important prosodic features were based on energy, pauses and F0. The masses and the compliances of spring were found to be the most important parameters of the two-mass vocal fold model. I
Age and gender recognition for telephone applications based on GMM supervectors and support vector machines
This paper compares two approaches of automatic age and gen-der classification with 7 classes. The first approach are Gaus-sian Mixture Models (GMMs) with Universal Background Models (UBMs), which is well known for the task of speaker identifica-tion/verification. The training is performed by the EM algorithm or MAP adaptation respectively. For the second approach for each speaker of the test and training set a GMM model is trained. The means of each model are extracted and concatenated, which results in a GMM supervector for each speaker. These supervectors are then used in a support vector machine (SVM). Three different ker-nels were employed for the SVM approach: a polynomial kernel (with different polynomials), an RBF kernel and a linear GMM dis-tance kernel, based on the KL divergence. With the SVM approach we improved the recognition rate to 74 % (p < 0.001) and are in the same range as humans. Index Terms — Acoustic signal analysis, speaker classification, age, gender, Gaussian mixture models (GMM), support vector ma-chine (SVM) 1
Multi-class Detection of Pathological Speech with Latent Features: How does it perform on unseen data?
The detection of pathologies from speech features is usually defined as a
binary classification task with one class representing a specific pathology and
the other class representing healthy speech. In this work, we train neural
networks, large margin classifiers, and tree boosting machines to distinguish
between four different pathologies: Parkinson's disease, laryngeal cancer,
cleft lip and palate, and oral squamous cell carcinoma. We demonstrate that
latent representations extracted at different layers of a pre-trained wav2vec
2.0 system can be effectively used to classify these types of pathological
voices. We evaluate the robustness of our classifiers by adding room impulse
responses to the test data and by applying them to unseen speech corpora. Our
approach achieves unweighted average F1-Scores between 74.1% and 96.4%,
depending on the model and the noise conditions used. The systems generalize
and perform well on unseen data of healthy speakers sampled from a variety of
different sources.Comment: Submitted to ICASSP 202
A Stutter Seldom Comes Alone -- Cross-Corpus Stuttering Detection as a Multi-label Problem
Most stuttering detection and classification research has viewed stuttering
as a multi-class classification problem or a binary detection task for each
dysfluency type; however, this does not match the nature of stuttering, in
which one dysfluency seldom comes alone but rather co-occurs with others. This
paper explores multi-language and cross-corpus end-to-end stuttering detection
as a multi-label problem using a modified wav2vec 2.0 system with an
attention-based classification head and multi-task learning. We evaluate the
method using combinations of three datasets containing English and German
stuttered speech, one containing speech modified by fluency shaping. The
experimental results and an error analysis show that multi-label stuttering
detection systems trained on cross-corpus and multi-language data achieve
competitive results but performance on samples with multiple labels stays below
over-all detection results.Comment: Accepted for presentation at Interspeech 2023. arXiv admin note:
substantial text overlap with arXiv:2210.1598
Classifying Dementia in the Presence of Depression: A Cross-Corpus Study
Automated dementia screening enables early detection and intervention,
reducing costs to healthcare systems and increasing quality of life for those
affected. Depression has shared symptoms with dementia, adding complexity to
diagnoses. The research focus so far has been on binary classification of
dementia (DEM) and healthy controls (HC) using speech from picture description
tests from a single dataset. In this work, we apply established baseline
systems to discriminate cognitive impairment in speech from the semantic Verbal
Fluency Test and the Boston Naming Test using text, audio and emotion
embeddings in a 3-class classification problem (HC vs. MCI vs. DEM). We perform
cross-corpus and mixed-corpus experiments on two independently recorded German
datasets to investigate generalization to larger populations and different
recording conditions. In a detailed error analysis, we look at depression as a
secondary diagnosis to understand what our classifiers actually learn.Comment: Accepted at INTERSPEECH 202
Automatic detection of sigmatism in children
We propose in this paper an automatic system to detect sigmatism from the speech signal. Sigmatism occurs when the tongue is positioned incorrectly during articulation of sibilant phones like /s / and /z/. For our task we extracted various sets of features from speech: Mel frequency cepstral coefficients, energies in specific bandwidths of the spectral envelope, and the so-called supervectors, which are the parameters of an adapted speaker model. We then trained several classifiers on a speech database of German adults simulating three different types of sigmatism. Recognition results were calculated at a phone, word and speaker level for both the simulated database and for a database of pathological speakers. For the simulated database, we achieved recognition rates of up to 86%, 87 % and 94 % at a phone, word and speaker level. The best classifier was then integrated as part of a Java applet that allows patients to record their own speech, either by pronouncing isolated phones, a specific word or a list of words, and provides them with a feedback whether the sibilant phones are being correctly pronounced
A survey on perceived speaker traits: personality, likability, pathology, and the first challenge
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks